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Abstract 


Astrometric surveys such as Gaia and LSST will measure parallaxes for hundreds of millions of stars. 
Yet they will not measure a single distance. Rather, a distance must be estimated from a parallax. In 
this didactic article, I show that doing this is not trivial once the fractional parallax error is larger than 
about 20%, which will be the case for about 80% of stars in the Gaia catalogue. Estimating distances is an 
inference problem in which the use of prior assumptions is unavoidable. I investigate the properties and 
performance of various priors and examine their implications. A supposed uninformative uniform prior 
in distance is shown to give very poor distance estimates (large bias and variance). Any prior with a 
sharp cut-off at some distance has similar problems. The choice of prior depends on the information one 
has available - and is willing to use - concerning, for example, the survey and the Galaxy. I demonstrate 
that a simple prior which decreases asymptotically to zero at infinite distance has good performance, 
accommodates non-positive parallaxes, and does not require a bias correction. 


1 Introduction 


The parallax of an object is its observed angular displacement with respect to a reference frame due to 
the movement of the observer over a baseline. Stellar parallaxes are measured using twice the Earth- 
Sun separation as a baseline. From geometry and the definition of the parallax, tu, the distance of a star 
from the Sun, r pc, is 1 jw arcsec to a very good approximation. 

Parallaxes are one of the few distance measures in astronomy which do not make assumptions about 
the intrinsic properties of the object. Hipparcos blazed the trail when it measured the parallaxes of over 
10® stars to an accuracy of around 1 mas down to G ~ 11 in the 1990s (Perryman et al. 1997). Gaia is 
currently extending this to 10® objects with expected accuracies as good as 0.01 mas (and still much less 
than a mas at its magnitude limit of G = 20) (Lmdegren et al. 2008), and LSST hopes to obtain parallaxes 
with standard deviations of order 1 mas for billions of stars down to much fainter magnitudes (Ivezic et 
al. 2011). 

It is therefore of considerable importance to know how to estimate distances (and their uncertainties) 
from parallaxes. Despite the simple relation between the two, inverting a parallax to give a distance 
is only appropriate when we have no measurement errors. As we always have measurement errors, 
determining the distance given a parallax becomes an inference problem. Here I show that this inference 
is not trivial when the errors are even still quite small, necessarily involves assumptions, and can lead 
to surprisingly large errors. I then set out to solve the following problem: given a measured parallax, w, 
and its uncertainty, <7^^, how can we obtain a good estimate of the distance and its uncertainty? 

Numerous studies in the astronomical literature have examined estimating physical quantities from 
parallaxes and possibly additional data (such as colours). These quantities maybe individual distances, 
absolute magnitudes, distances to clusters, etc. Example studies include Lutz & Kelker 1973, Smith & 
Eichhorn 1996, Brown et al. 1997, Verbiest et al. 2012, Palmer et al. 2014. My objective is not to repeat 
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such analyses, but rather to illustrate the issues involved using what is arguably the simplest of these 
problems: estimating distance from a single parallax. More complex problems will need to consider 
these same issues, even though in many cases we will not want to estimate distances explicitly. We 
should instead infer the quantity of interest directly. If we want to compare model predictions with 
data (which is not the topic of this paper), then it is usually better to project the predictions (and their 
distributions) into the data space, where the noise (measurement) model is better defined, rafher fhan 
inferring quanfities and then trying to determine an appropriate noise model to use in the comparison. 


1.1 Definitions 

I use w to denote parallax (always in arcsec) and r to denote distance (always in pc). Where it is neces¬ 
sary to make a distinction, I use rtme to indicate the true distance, and rest to indicate an estimate of this, 
such as the mode r^node, or median, rmed-1 will frequently refer to the fractional parallax error, the ratio of 
the (Icr) parallax rmcertainty to the parallax. This can either be the empirical quantity based only on the 
measured data, / = u^jw, or a quantity based on the true distance, /true = Uro?'true- In a real application 
rtrue and /true are of course rmknown. 


2 First steps 

2.1 Measurement model (likelihood) 

For a star at true distance r, its true but unknown parallax is l/r. The measured parallax, w, is a noisy 
measurement of 1 /r. I will assume that vd is normally distributed with unknown mean l/r and known 
standard deviation 0-^ • That is, I assume vd has been drawn from the distribution 

P{vD\r,a^) = —exp 

V 27r(j^ 

which is Gaussian in w, but of course not in r. This frmction is the measurement model, or likelihood. 
It gives the probability density frmction (PDF) - probability per unit parallax - for any w, given values 
of r and 

This is the measurement model used in the Hipparcos and Gaia data processing. 0-^ depends in par¬ 
ticular on the centroiding accuracy of each of the individual position measurements which are used in 
the astrometric solution. This in turn depends primarily on the number of photons (N) received from 
the star (ct^ — 1/ VN), and thus on its brightness, the observing time per observation, and the number 
of observations (assuming the observations have an appropriate cadence to determine a parallax at all). 
Depending on the source brightness, other terms and systematic errors may also be significant. The two 
most important points here are (1) is independent of vj, and (2) can be estimated from another 
measurement model using other measured data (de Bruijne 2012)lJ 

Equation [T] has a finite probability of giving non-posifive parallaxes, and fhis probability gets larger 
with increasing /true- An astrometric reduction can indeed produce a negative parallax, because the 
"measured" parallax is the result of reducing a set of noisy angular measurements. It corresponds to 
the parallactic displacement being in the opposite direction on the sky from that expected due to the 
movement of the observer along the baseline. It does not correspond to a negative distance, because 
r > 0 by definition. Non-positive parallaxes are perfectly reasonable measurements, and we will see 
that they can deliver useful distance information. 

Strictly speaking the likelihood must be changed at extremely small distances, because then the defi¬ 
nition that the (noise-free) parallax is the reciprocal of distance breaks down. But this only occurs for 
parallaxes on degree scales, whereas all measured stellar parallaxes are less than an arcsecond. 


'As this measurement model is an approximation and uses noisy measurements of the star's brightness, it will not return the 
true value of . But this will generally be a small uncertainty compared to the noise in the parallax, so 1 don't consider it here. 


2a2 


zu — 


where (7^ > 0 


( 1 ) 
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2.2 Intuition 


Suppose we have a measurement 

One may be tempted to report 1 jw as the distance estimate, and use a first order Taylor expansion to 
give as the uncertainty. But we will now see that this quickly becomes a poor approximation. 

From the definition of the Gaussian, the intervals 

1/r = [tn —2CTro,n7] and l/r = \pD^w (2) 

each includes a fraction 0.954/2 = 0.477 of the total probability of the distribution P(vo \ r, CTro). The 
transformation from l/r to r is monotonic and so preserves the probability. Thus the intervals 

r = [l/w, l/(n7 — 2crro)] and r = \Vl[w p2a^\\lw\ (3) 

must each also contain a fraction 0.477 of the total probability over the distance. But whereas these 
intervals are equally sized in l/r (the Gaussian is symmetric), they are not in r. For example, with vj = 
0.1 and Uct = 0.02, these intervals are r = [10,16.67] and r = [7.14,10] respectively. The uncertainties do 
not transform symmetrically because of the nonlinear transformation from l/r to r. We see that the first 
order Taylor expansion suggested above is a poor approximation even for relatively small errors (here 
/=l/5). 

But what if the errors are larger, with / = 1/2? The upper distance interval in the example becomes 
r = [1/w, oo]. If / is even larger, then this interval becomes undefined and we seem to "lose" some of 
the probability. As the Gaussian distribution has infinite support for all values of w and <7^/ some finite 
amount of probability in the likelihood function will always correspond to an undefined distance. 

The problem here is that we are trying to make a probabilistic statment about r using just equation [TJ 
yet this is a probability distribution over w, not r. The solution is to pose the problem correctly. 


3 The inference problem 


Given w and a^, we want to infer r. As there is noise involved, we know we carmot infer r exactly, so 
we want a probability distribution over the possible values of r. That is, we want to find P{r \ w, 
From Bayes theorem (which follows from the axioms of probability) this is related to the likelihood by 

P{r\w,a^) = ^P{w\r,a^)P{r) (4) 

where Z is the normalization constant 

nr—CO 

z = f Pivj I r, CTro) P{r) dr (5) 

Jr=0 

which is not a function of r. P{r) is the prior, and expresses our knowledge of - or assumptions about 
- the distance, independent of the parallax we have measured. P{r \ vj, a^) is the posterior distribution. 
By multiplying the likelihood by a prior, we essentially transform an expression (the likelihood) for the 
probability of the known data (parallax) given the unknown parameter (distance) to an expression (the 
posterior) for the probability of the parameter given the data. 

We see that we can only infer the distance in a probabilistic sense if we adopt a prior. Some people object 
to this on philosophical grounds {How can science depend on assumptions?), others on practical grounds 
{How can I knoiv the prior if I haven't yet measured any distances?). The latter is a valid objection and will be 
discussed later. Yet without a prior, we run into the problems we just saw in the previous section. 


4 An improper uniform prior 


A common strategy for dealing with prior discomfort is to adopt a imiform prior over the full range of 
the parameter, on the grounds that this does not prefer one value over another. In the present context 
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Figure 1: The unnormalized posterior P;* (r | tu, (improper uniform prior) for ru = 1/100 and four 
values of / = (0.1,0.2,0.5,1.0). The posteriors have been scaled to all have their mode at Pj* (r | U7, Uro) = 
1. As the prior is improper, the posterior remains finite out to infinite distance for all values of / > 0. 


we should go one step further and use 




1 if r > 0 
0 otherwise 


( 6 ) 


so as to introduce the definition that distances must be positive. The * symbol is used to indicate that 
the PDF is rmnormalized. The subscript is an abbreviation of the distribution. Because it extends to 
infinity, this prior cannot be normalized. Such priors are referred to as improper. From equation 31 the 
(urmormalized) posterior, Pj* (r | U7, cr^j), in this case is the likelihood (equation3]l but now considered 
as a function of r rather than vu, and subject to the additional constraint that r must be positive, i.e. 


Piui'r\'cu,(7^) = 


P{zu I r, (Tro) if r > 0 
0 otherwise. 


(7) 


Examples of this posterior are shown in Figure 3] for vu = 1/100 and various values of /. This demon¬ 
strates the skewness discussed in section IZ2l 

Inspection of equation 31 shows that 

lim P;*(r|n7,CTro) = const . (8) 

r—^■oo 

The posterior does not converge, has an infinite area and so carmot be normalized. Consequently it 
has no mean, no standard deviation, no median, and no quantiles. The only plausible estimator of the 
distance is the mod^E which we see from Figure 31 is well defined for all values of /, and is equal to 
Xjw for w > 0. For non-positive parallaxes the posterior increases monotonically from 0 at r = 0 to 
a finite asymptote (equation 3]l putting the mode at r = oo, which is physically implausible in a finite 
universe. This is bad, as negative parallaxes are valid measurements. We should also be concerned that 
this estimator ignores the parallax uncertainty, even though we have seen that its magnitude determines 
the skewness of the distance distribution (section l2.2b . An uncertainty could be derived from the full 
width at half maximum (FWHM) (or similar), but without a normalizable posterior there is no useful 
probabilistic interpretation of this. 


^ As the prior is uniform here, the maximum of the posterior equals the maximum of the likelihood when the latter is expressed 
as a function of r. 
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4.1 Empirical test 


The above mentioned problems notwithstanding, we can still ask how good the mode of this posterior 
is as an estimator. Recall that the posterior is a function of the measured parallax, which due to noise 
is not equal to the true parallax, so 1/w \s not equal to the true distance. Thus to assess the quality of 
any estimator we need to test it using noisy simulated data. In particular, we would like to see how 
its accuracy varies as a function of the expected fractional parallax error. I do this with the following 
empirical test procedure: 


1. assign a fractional parallax error, /true 

2. assign a true distance, rtrue (which together with /true determines 

3. simulate a parallax measurement, w, by drawing a value at random from the likelihood with 

— ‘^true 

4. use w in the posterior to evaluate the distance estimator, rest 

5. calculate the scaled residual, x = (rest — ?'true)/?'true 

6. repeat steps 2-5 numerous times to compile a set of values {xi} at a single value of /true 

7. calculate the bias, &(/true) = x, and sample standard deviation, s(/true) = ~ 

8. repeat steps 1-7 for different values of /true- 


The bias and standard deviation are standard tests of the quality of an estimator. A good estimator will 
have small values of both, where "small" is, of course, a relative term. By using the scaled residuals 
(rather than just the residuals) I can collate information on many different true distances into a single 
value of b and s for a given /true- 


In the first test I assign true distances in step IHby drawing at random from the distribution 


P^2{r) 



if 0 < r < riim 
otherwise 


(9) 


which corresponds to stars being uniformly distributed in three-dimensional space, PiV) = constH I use 
?'iim = 10^- For each value of /true I sample 10® true distances. I do this for 100 values of /true ranging 
from 0.01 to 1.0 in steps of 0.01. The estimator in this case is the mode, which is just l/vu. If the parallax 
is non-positive (in step |3), then for the improper uniform prior I throw away the sample, because an 
infinite distance would give an infinite bias and standard deviation (telling us straight away that it is a 
bad estimator!). The procedure for other priors I will discuss as we encounter them. 

The results are shown in Figure|2l We see that the standard deviation increases steadily until /true — 0.22, 
then it explodes. The bias also increases sharply, in this case reaching a value of about 4 at /true = 1- The 
reason for this appearance is that as /true increases, the probability increases that the measured parallax - 
which is drawn from the likelihood - becomes arbitrarily close to zero, in which case rest = 1 Irxi becomes 
arbitrarily large. This also explains why both the bias and standard deviation become very noisy for 
/true ^ 0.22: the exact appearance of the curves depends on the sample drawn in the simulation!! 


Although noisy, the plot is independent of rum, because we are calculating functions of the scaled resid¬ 
uals, X, for fixed /true- Doubling rum would double both rtrue and rest, leaving x unchanged. 


One might argue that by drawing true distances from a distribution which is very different from the 
uniform prior adopted in the posterior, we are bound to get poor results. To some extent this is true. 
To test this, I repeat the above procedure, but drawing from the prior in equation Of course, to be 
able to draw from this I have to set an upper limit (otherwise all draws will be infinite), and I again use 

®For a constant volume density the probability, P(V)dV, of finding a star in a shell with inner and outer radii (r, r + dr) is 
proportional to the volume of that shell, i-Kr'^dr. P{V)dV = P(r)dr, so P{r) oc r^. 

^The definition of the point at which the standard deviation, and especially the bias, "explodes", is somewhat arbitrary, but 
it appears to be reasonably stable for sufficiently large samples. Furthermore, repeating the experiment with larger samples, it 
seems that the point at which the standard deviation "explodes" converges to /true — 0.20. Below this value we get a very smooth 
dependence of bias and standard deviation on /true- At values of /true well below this "explosion" we could even use the Taylor 
approximation mentioned in section lZ2l to estimate the distance and a (symmetric) uncertainty. 
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Figure 2: The bias (black) and standard deviation (blue) as a function of /true for the mode estimator 
of the posterior P*^{r \ w, a-^j) (equation[71 which uses the improper uniform prior) for dafa drawn from 
the constant volume density prior (equation^. The right panel is a zoom of the left panel. The red line 
in the right panel shows the fraction of samples which had to be rejected because they had non-positive 
parallaxes. 
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Figure 3: As Figure|2j but now drawing data from fhe fruncafed uniform prior. 


riim = 10^ (alfhough the exact value is irrelevant, because the scaled residuals are independent of it). 
The results are shown in Figure |31 We see that they are hardly any better than before. The problem is 
fherefore not prior mismatch, but the fact that 1/w (the mode) is a very noisy estimator once ftrue ^ 0.22. 
Because this posterior is not normalizable, no other estimator is available. 

This example shows that an apparently innocuous approach to inference is in fact highly problematic. 
It is the kind of fhing which may be done by those who reject the Bayesian approach to inference on the 
grounds that it depends on prior assumptions. They want to rely only on the likelihood, yet have to 
adopt some kind of prior to get logically consistent answers, and tacitly choose an unbounded uniform 
prior in r in the belief fhat if is "uninformative". Yef one can equally argue fhat uniform in log r (i.e. 
in distance modulus) is uninformative. In fact, by adopting a uniform prior in r, one assumes fhat the 
volume density of sfars varies as P(V) ^ 1/r^ (see fhe discussion around equation|9), which is highly 
informatively 


^Uniform in logr corresponds to uniform in 1/r and so P(V) ~ 1/r®. 
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Figure 4: Left: the unnormalized posterior P*{r \ w, cr^) (truncated rmiform prior with rnm = 10^) for 
w = 1/100 and four values of / = (0.1,0.2,0.5,1.0) (black lines). The red line shows the posterior for 
w = —1/100 and |/| = 0.25. The posteriors have been scaled to all have their mode at at P *(r | w, 0-^) = 
1. Right: the same five posterior PFDs but now normalized. 


Clearly, this approach is not very robust: it will only give distance estimates for /true ^ 0.22, it cannot 
give self-consistent uncertainty estimates (only a poor approximation using a Taylor expansion), and it 
breaks down entirely with non-positive parallaxes. Can we do better? 


5 A proper uniform prior 


An obvious improvement is to truncate the prior at some value. This prevents the inferred distance 
becoming very large, so should reduce the explosion of the bias and standard deviation we saw with 
the improper prior for larger /true- The normalized prior is 


-Pu(r) 


- if 0 < r < riim 

riim 

0 otherwise. 


( 10 ) 


where rum is the largest distance we expect for any star in our survey. The posterior is the same shape 
as in Figure [T]but set to zero for r > riim 


P*{r\w,a^) 


-^P{w\r,a^) 

riim 

0 


if 0 < r < riim 
otherwise . 


( 11 ) 


This is shown in Figure H) Note the effect of the normalization^ The right panel illustrates how the 
posterior is a combination of the likelihood and prior. When the data are good (/true is small), the 
likelihood is narrow compared to the prior, so the former dominates the posterior. As the data become 
less informative (larger /true)/ the posterior gets broader and the prior plays more of a role. For non¬ 
positive parallaxes the posterior increases monotonically from 0 at r = 0 to a maximum at r = runi (the 
red line shows an example). 


^This and all subsequent posteriors have no simple solutions for their integrals, so the integrals have been evaluated numeri¬ 
cally for the plots by sampling on a fine grid. 
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Figure 5: The bias (black) and standard deviation (blue) as a function of /true for the mode estimator 
(equation [T2I1 of the posterior Pu{r\w,an,) (equation [TTJ which uses the truncated uniform prior) for 
data drawn from a fruncated uniform prior in r. The right panel is a zoom of fhe leff panel. The solid 
red line shows the fraction of samples which were rejected because they had non-positive parallaxes. 
The dashed red line shows the fraction of samples with an extreme mode (1/vj > riim) which was 
truncated to nim. 


The mode of the resulting posterior serves as our distance estimator, and is 


1 

if 

1 



0 < — < 

I’lim 

w 


w 


Him 

if 

1 

— > nim 

(extreme mode) 



W 

1'lim 

if 

0 

VI 

B 



( 12 ) 


I call a mode solution "extreme" if the estimate from fhe likelihood alone would place it beyond riim, 
and so is truncated by the prior to rum. That non-postitive parallaxes are set to rum is consistent with the 
nature of the parallax measurement (see section l2Tll . 


I redo the empirical test with data drawn from the uniform prior (with a common value of rum). To 
allow a fairer comparison with the results obtained with the improper uniform prior, I again reject 
simulated parallax measurements in step |3] which are non-positive. The results are shown in Figure |5j 
They are again independent of riim. The bias and standard deviation are significantly reduced. This 
is not surprising, because we are now clipping all those large distance estimates to riim. The dashed 
line shows that relatively few objects have their distances clipped in this way - only 10% at /true = 0.25 
and 20% at /true = 1 - which in turn shows that the explosion in the bias and standard deviation seen 
previously was due to a minority of objects. An alternative to clipping these distances would be to reject 
them, which means we decide we cannot estimate distances. The results of fhe empirical fesf in fhat case 
(for the same data) are shown in Figure |6l As we would expect, rejecting these cases reduces the bias 
and standard deviation, but not by much. 

Compared to the improper uniform prior at least, the results look quite good. In Figure]^ the bias and 
standard deviation increase approximately linearly for /true < 0.25 wifh gradients of 0.10 and 1.1 re¬ 
spectively; fheir values are 0.025 and 0.29 respectively af /true = 0.25. One mighf think about correcting 
the bias (see section|8ll, but the need for fhis will be later obviated by the use of a betfer prior. 

Now fhat the posterior is normalized we could investigate using the mean or median. But these are 
strongly influenced by fhe choice of rnm for large values of /true/ which is just the domain where we 
want a more robust estimator than the mode. 


The results in Figures|5]and[5]are for the situation when the real data come from fhe same disfribufion as 
fhe prior. The shape of fhis prior aside, individual distance estimates for small parallaxes and/or large 
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Figure 6: As Figure|5l but now rejecting all samples which have 1 jvj > rum- 


values of /true will be sensitive to the choice of riim. Astrophysically this corresponds to knowing the 
maximum distance of any star. This could be set for each star individually according to its measured 
apparent magnitude, m, and the fact that stellar evolution limits the absolute magnitude, Miim, to about 
—5 mag. The flux continuity relation then tells us that 5 log^Q rum = m — Mu^ — Am + 5, where Am is the 
interstellar absorption in the observed band. Setting Am = 0 gives us the maximum distance (although 
this will be an overestimate for observations at low Galactic latitudes). A more stringent upper limit on 
Miim could be set if the star's colour were known. 

Truncating the prior has helped, but it still corresponds to assuming that the volume density of stars 
varies as P{V) ^ 1/r^. This is not only a strong prior, it is also scale independent, which the Galactic 
stellar distribution most certainly is not. 


6 A constant volume density prior 


One could adopt a more complex prior based on a Galaxy model. We may want to do this in practice 
(see section ID, but this gives us little insight into the role of the prior. Let us instead take at least an as¬ 
trophysically reasonable yet less informative prior, namely a constant volume density of stars (equation 
0. This too is truncated at r = rum in order to be normalizable. The unnormalized posterior is 


P*2{r \uj,a^) 


r 

— exp 



if 0 < r < riim 


0 


otherwise . 


(13) 


Examples of this posterior are shown in Figure[7|for U7 = 1/100, riim = 10^, and various values of /. For 
low / the posterior looks unimodal, but then appears to develop a minimum (and thus a second mode 
at r = riim), until for / > 0.35 the lower mode (and hence the minimum) disappears, leaving a posterior 
which increases monotonically from zero up to rum. 


We can find the turning points of the posterior by solving dP*2{r \ vu,a.^)jdr = 0 for r. This gives a 
quadratic equation with solutions 

rmin,rmode = (l ± “ 8/^) • (14) 

The solution with the positive sign is the minmum (rmin) and that with the negative sign is the mode 
(rmode)- They are plotted as a function of / in Figure |51 There is an additional mode at r = Him due to 
the discontinuous cut-off in the prior. We see that for / > 1 /VS (= 0.354) neither turning point is defined: 
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Figure 7: Left: the unnormalized posterior P *2 (r | w, a^) (truncated constant volume density prior with 
Him = 10^) for tx7 = 1/100 and eight values of / = (0.1,0.2,0.25,0.3,0.34,0.36,0.5,1.0). The posteriors 
have been scaled to all have their mode at at P *2 (r | zu, = 1- Right: the same posterior PDFs but now 
normalized. The curves with the clear maxima around r = 100 are / = (0.1,0.2,0.25,0.3) in decreasing 
order of the height of the maximum. 




Figure 8: The value of the roots (equation[T4| of the posterior P *2 (r | zu, (equation[T3l which uses the 
constant volume density prior) as a function of /. The two roots are the minimum (rmin/ left) and the 
mode (rmode/ right), zu is the measured parallax. 


the posterior transitions from having two modes and a minimum to the single mode at r = rum. Thus 
for larger values of / we would need to resort to another estimator; rum itself would be consistent. 

Equation[T4]shows that the mode scales linearly with l/w: the function in the right panel of Figure|8]is 
the factor which we multiply 1 /nj by to estimate the distance. It depends on the parallax error (through 
/),but not on Him. 

Contrary to expectations (perhaps), the posterior would have a minimum for all (positive) values of 
/ < l/\/8, but if it occurs at r > rum it will not be seen due to the cut off from the prior. In the case 
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Figure 9: The normalized posterior Pj.2 (r | zu, (truncated constant volume density prior with rum = 
10^) for w = —1/100 and four values of |/| = (0.1,0.2,0.5,1.0) (black lines). The red line shows the 
posterior for zu = 0 for ^ l/nim (/ is then undefined). 


shown r\i^w = 10, so the minimum will only drop below rum when / > 0.2120 The minimum goes to 
infinity in the limit / —> 0. 

The posterior is also defined when zu < 0, as can be seen by inspection of equation [13] and shown in 
Figure O This demonstrates that when we use a prior, negative or zero parallaxes provide information 
in agreement with our intuition: they are just noisy measurements which happened to end up non¬ 
positive. They imply the distance is likely to be large (up to the constraint imposed by prior knowledge), 
and the range of probable solutions depends on the size of the parallax error (a smaller error means more 
constraint). This intuition is expressed quantitatively by the probabilistic approach. 


Let us now test this estimator with the same empirical test as used before (section [4]T)|. I draw the data 
from the constant volume density prior (equation[9]( with the same value of rum as used in the posterior 
(10^). My distance estimator is the lower mode of the posterior when it is defined, i.e. rmode in equation 
[THif / < 1/and w > 0, and rnm otherwise. Note that this decision can be made using only measured 
data (it involves /, not /true)- If ^mode > rum - an extreme mode - then I set rest = rum, because this is 
what the prior tells us. Thus 


^est 


rmode 

if 

w > 

riim 

if 

w > 

riim 

if 

w > 

riim 

if 

VI 


0 and / < 1/Vi 
0 and / < l/Vi 
0 and / > 1/Vi 
0 


and Tmode ^ riirn 
and Tmode ^ rijru 
(undefined mode) 


(extreme mode) 


(16) 


The results are shown in Figure [TOl Note that the curve is defined for all positive value of /true (the 
horizontal axis), in particular beyond 1/v^. At any given value of /true the samples will have a range of 
values of / due to the parallax sampling in step |3] of the empirical test. But these still contribute to the 
sample for the specific value of /true- 

We see in the figure how the fraction of samples with an undefined mode (dotted red line) increases 
steadily with /true- The majority of the samples fall in this category once /true > 0.35. This largely 
explains why the bias and standard deviation saturate: setting most distances to r\i^ limits the distance 
error incurred. The fraction of samples with an extreme mode (dashed red line) initially increases, but 


^This comes from solving the positive sign solution of eauationll4lfor / (> 0) with r-min = rum- The solution is 
/ = 


2riini^ 


riim-ro 


(15) 
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Figure 10: The bias (black) and standard deviation (blue) as a function of /true for the mode estimator 
(equation [T^ of the posterior P* 2 {r \vj,aj 2 j) (constant volume density prior; equations [131 and [Mt for 
data drawn from the same prior. The right panel is a zoom of the left panel. The dashed red line shows 
the fraction of samples with an extreme mode {r^ode > ?’iim)- The dotted red line shows the fraction of 
samples with an undefined mode (/ > 1/\/8). In both these cases the distance estimates for the samples 
are set to nim. 




Figure 11: As Figure [TOl but now the samples with extreme and undefined modes are rejected rather 
than set to riim. 


then decreases, because for large parallax errors fhe measured parallax is increasingly likely eifher to be 
negative (which is not considered an extreme mode) or positive but so small that / > l/\/8, in which 
case the mode is undefined and the sample contributes to the dotted curve instead. 

Using this prior and strategy, we get much better results than with the improper uniform prior (Figure 
0. We also achieve a lower standard deviation than with the trimcated uniform prior (Figure |5) for 
larger values of ftme, but at the price of a much larger bias. Note that these results are for data drawn 
from the same prior as that used in the posterior. With a mismatched prior we will get different results. 

Instead of truncating the extreme and undefined mode values to rum, we could just reject them, as we 
did with the truncated uniform prior in Figure The results of this are shown in Figure [11] The dotted 
and dashed lines are the same (they are statistically the same data), but now the standard deviation and 
especially the bias look quite different. Indeed, the bias now turns negative for some values of /true- This 
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is because we are rejecting exclusively samples which would otherwise be assigned the largest distance 
possible. This also helps to lower the standard deviation. But once /true increases nearly to 1 (100% 
expected parallax errors), the standard deviation and bias are large again. Even worse, we have had to 
reject over 95% of the data. 

While a constant volume density prior is more desirable from a physical point of view than the prior 
uniform in r, it suffers from the same problem: a discontinuous cut-off in rest. For lower accuracy 
parallaxes, this demands that we either throw away those data or assign a fixed distance, rnm. Both are 
undesirable. 


7 An exponentially decreasing volume density prior 


We would like to have a posterior which yields a distance estimate for any value of / and zu, including 
non-positive parallaxes. This can be achieved by replacing the sharp cut-off of the previous priors with 
something which drops asymptotically to zero as r —oo. Here I investigate a prior which produces an 
exponential decrease in the volume density of stars, P{V) ^ exp(—r/L), namely 


P,2e-r(r) 



j.2g-r/L 


if r > 0 
otherwise 


(17) 


where L > 0 is a length scale. The unnormalized posterior is 

-exp 

P/2e-r(r |U7, cr^) = < 

io 



if r > 0 
otherwise . 


(18) 


Examples of this posterior are shown in Eigure [121 We see a dependence on / which is similar to that 
of the prior for small r, but now with a smooth peak at large r and a continuous decline beyond. 
Depending on the value of /, we may have one or two modes. In this example there is a single mode for 
0 < / < 0.30, two modes for 0.30 < / 0.373, and one mode for / > 0.373. Note that although L is a 

characteristic length scale of the posterior, the mode corresponding to this is at a somewhat larger value 
of r. 


The larger /, the less informative the data, and in the limit f ^ oo the posterior just becomes the prior 
for any value of the parallax (positive, zero, or negative). Indeed, already at / = 1 the posterior is almost 
indistinguishable from the prior (the green line). 

The red line in Eigure (121 shows the posterior for a negative parallax of —1/100 with |/| = 0.25. If |/| 
were smaller this posterior would shift to the right. This makes sense, because a smaller | /1 means we 
are more confident that the true parallax is close to zero. As |/| gets larger the posterior shifts to the left, 
eventually converging to the prior. 


To find the mode we set dP* 2 g-r (?■ | m, u-^^jdr = 0 which gives 

— - 2r^ -I- —r- 5 - = ^ ■ 

L oL (tL 


(19) 


This is a cubic equation which generally has three complex roots, but fewer solutions here due to the 
physical limitations on the signs of the variables. The roots are a function not only of / = a^jvo, but 
also of L and Depending on these values, the posterior may have three real roots, corresponding to 
two modes and one minimum, or just one real root, corresponding to a single mode. 


Inspection of the roots leads to the following strategy for assigning the distance estimator, rmode/ from 
the modes: 


• If there is one real root, it is a maximum: select this as the mode. 

• If there are three real roots, there are two maxima: 
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Figure 12: The black lines in the left panel show the unnormalized posterior P* 2 ^-rir \ zu,(7^) (expo¬ 
nentially decreasing volume density prior; equation [181 for L = 10^, zu — 1/100 and seven values of 
/ = (0.1,0.2,0.29,0.31,0.33,0.5,1.0). The red line is the posterior for tn = —1/100 and |/| = 0.25. The 
green curve is the prior. The right panel is a zoom of the left one and also shows an additional posterior 
for / = 0.36. All curves have been scaled to have their highest mode at P *2 _r(r | ru, a^) = 1 (outside 
the range for some curves in the right panel). 




Figure 13: The distance estimator mode, rmode/ of the posterior P* 2 g-r {r \vj,a^) (which uses the expo¬ 
nentially decreasing volume density prior; equation [18) shown as rmode^^ as a function of /true- Each 
line is for a different value of the parallax, w, and labelled with log^o ^7- The left panel is for L = 10^ and 
the right panel for L = 10^. Note the (different) log scales on the vertical axes. 


- If ra7 > 0, select the smallest root (value of r) as the mode. 

- If ra7 < 0, select the mode with r > 0 (there is only one). 

The other two possibilities (zero or two real roots) do not occur for > 0, L > 0. 

The variation of Tmode as a function of / for different vj and L is shown in Figure 113^1 Let us follow 
the curve labelled "—2" {zu = 1/100) in the left panel, which corresponds to the posteriors plotted in 
Figure (121 At small values of /true (below 0.30), the posterior has a single mode, and the value of rmode 
(for a given zu) increases slowly and smoothly with increasing fractional parallax error. As /true rises 

*For a given L and ro, the variation of rj? vi^ith respect to /true is independent of ro. 
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Figure 14: The bias (black line) and standard deviation (blue line) as a function of /true for the mode dis¬ 
tance estimator of the posterior P *2 _r(r | w, a^) (exponentially decreasing volume density prior; equa¬ 
tion [T^l with L — 10^ for data drawn from the same prior. The black circles and blue triangles are the 
bias and standard deviation respectively of the median of the posterior. The right panel is a zoom of the 
left panel. 


above about 0.30, a second mode appears, but I continue to use the mode at the lower value of r as the 
distance estimator, because we can think of this one as being dominated by the data: it continues to 
evolve smoothly from the data-dominated regime (small /true)- Once /true rises above 0.373, there is a 
sudden increase in the value of rmode- This corresponds to the "data-dominated" mode disappearing, 
leaving only the other,"prior-dominated", mode for all larger values of /true- 

If the measured parallax were smaller, say logj^g zu = —3, but L the same, then we see from Figure [131 
(left panel) that the posterior only ever has one mode for all /true- This is because the data and the 
prior are now indicating distances on a similar order of magnitude. If we instead used a larger L for 
logto ^ (right panel, line labelled "—3"), the measured parallax is again quite different from the 

prior, so we again see two modes for smaller ftrue, and the transition to a single mode for larger ftrue- 
We only and always get such transitions when vaL ^ 1. Whenever we have two modes, it is always 
the smaller one which is dictated by the data, so we can always make the correct choice of distance 
estimator. 

To assess the performance of this posterior and estimator, I perform the empirical test drawing data 
from the same prior (equation ll7ll as used in this posterior, both with L = 10^. The results are shown in 
Figure [lH The variation of the standard deviation is similar to what we have seen before with proper 
priors, and similar in scale to that obtained with the prior. The standard deviation increases linearly 
when /true < 0.25 with a gradient of 1.0, slightly better than the prior in Figured and now without 
having to reject any data. But in the present case we get essentially no bias even for large /. This lack 
of bias was not seen with the previous priors, even when we had a match between the prior and the 
distribution of true distances. 

The explanation for this is that we are now using a prior which decreases continuously and asymp¬ 
totically after some distance. The particular choice of exp(—r/L) is not important to achieve this. The 
problem with using a sharp cut-off in a prior is that a star with a sufficiently large fractional parallax 
error will, before applying the cut-off, have a significant probability for distances greater than the cut¬ 
off distance. If the mode lies beyond this value, applying the cut-off will cause a "build up" of inferred 
distances at the cut-off, resulting in a strong bias. This can occur even for stars with very large paral¬ 
laxes (small distances): not only stars with a true distance near to the cut-off distance are affected. If the 
fractional parallax error is large enough, the sharp cut-off will always cause a problem, and increasing 
the cut-off distance will not help. This shows that tnmcating the prior at the largest distance expected 
based on the observed magnitude will not avoid the bias. One might argue that tnmcating the prior at 
very large distances will have little impact in practice, but with an asymptotically decreasing prior this 
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Figure 15: As FigureO but now drawing data from the prior with rum = 10^ (and the left panel now 
has a different vertical scale). 


is even urmecessary, because at r ^ L there is vanishingly small probability anyway. 

Returning to Figure [131 we may be tempted to conclude that using L = l/w would be a good idea, 
because it would give rmode tu — 1. But the parallax is a noisy measurement, sow ^ 1/rtrue- We are not 
aiming to get 1 jw as our estimator (we saw in earlier sections how bad it is). The empirical test confirms 
that using L = l/ro in the prior gives poor results. 

So far 1 have only investigated the mode of this posterior. Finding the mode is generally easy, because 
it involves differentiation and solving a polynomial equation. But it is not necessarily the best estimator 
in terms of bias and variance. Finding the mean and median (or any other quantile) involves integrals 
(semi-infinite and finite) of the form r"P^* 2 g_,(r | over r for n = 1 (and n = 2 for the standard 

deviation, if we wanted this as an error estimate). These integrals have no simple solution, so we must 
resort to a numerical approach. Where possible I use adaptive quadrature techniques. For some combi¬ 
nations of parameters this approach can find no solution, in which case I use the Metropolis algorithm]^ 
This is slow compared to adaptive quadrature, so 1 have only computed the bias and standard deviation 
of the median at a few values of /true- These are plotted in Figure |T4| as black circles and blue triangles 
(respectively). The median has a significant bias for larger /true- The mean has a similar profile, but with 
slightly larger values of both bias and standard deviation. So of all the obvious estimators, the mode 
appears to be the best. 

What happens if we have a mismatch between the distribution the data are drawn from, and the prior 
assumed in the model? Figure (151 shows an example of this, where data are drawn from the trrmcated 
constant volume density prior (P(r) oc r^) with rum = 10^. As the posterior extends to considerably 
larger distances than the maximum true distance of any star in the sample, we inevitably overestimate 
distances when /true is large: The estimation of course only sees the noisy measured parallax, which 
for large /true has a high probability of being much smaller than 1/rtrue- If we repeat this experiment 
but now drawing data from a prior which extends to larger distances, e.g. rnm = 10'^, then the posterior 
estimator will tend to underestimate distances. There is not much more you can do about this other than 
to try to construct a well-matched prior and to report confidence intervals on your distance estimates by 
calculating quantiles on the posterior: I suggest 5% and 95%. This interval will be large for large / 
(see the left panel of Figure [T2ll . but large fractional parallax errors necessarily mean that our distance 
estimates are poor and that we are dependent on the prior. Yet this is arguably preferable to throwing 
away data once / rises above some arbitrary value. 

While this prior overcomes many of the problems we saw with other priors, I am not suggesting that we 

^After some trial and error I found that reasonably good results could be obtained using a step size equal to the mode of the 
posterior and initialized at the mode, with a chain length of 10® samples and a short burn-in. A longer chain would reduce the 
noise, but takes longer to compute. 


16 





always use it. The point of the examples in the previous few sections was to show that the choice of prior 
(and esfimafor) can have a significant impact on the inferred disfances, and that some of the obvious 
choices may be poor. One should always think about what is an appropriate prior, given the available 
information. This is discussed further in section|9l The main application of the exponentially decreasing 
volume density prior will be when we have very little information other than the parallax, and/or want 
to make minimal assumptions. 


8 A bias correction? 


Some of the literature on astrometry has been concerned with trying to apply bias corrections to using 
Xjw as a distance estimator (e.g. Lutz & Kelker 1973, Smith 2003, Francis 2013). Unfortunately, many 
of fhese corrections are only valid under very restricted circumstances, only apply to small (< 0.2) 
fractional parallax errors, or are even wrong (Brown et al. 1997, Brown 2012). Furthermore, we have just 
seen that there are priors which give unbiased estimates, so do we need to ever make bias corrections? 
If there is a mismatch between the prior and the true distribution the stars are drawn from, then we 
might. But this requires that we know what the mismatch is, in which case we could improve the prior. 
It seems far betfer fo always improve the prior than to make ad hoc corrections. 

Suppose we did have a desirable estimator which is possibly biased (e.g. one which gives a lower stan¬ 
dard deviation than an unbiased estimator). Is there a useful correction which could be applied which 
does not simultaneously increase the standard deviation by too much? This is an issue, because any bias 
correction is itself a funcfion of measured and therefore noisy dafal^ The bias and standard deviation 
plots I have shown, plotted against /true (= Urortrue)/ show the bias and standard deviation we expect to 
obtain on average as a function of the expected fractional parallax accuracy. They are useful fo predict 
the quality of an estimator. But they cannot be used to calculate a bias correction, because they depend 
on an unknown quantity. We would instead have to plot the bias against / (= arnjw). This could be 
done, but it is questionable whether the use of a suifable prior will ever bring up fhe need to make a bias 
correction. 


9 The choice of prior and estimator 


There are many more priors and posterior-based estimators one could think of. A good prior is one 
which is as closely matched to the expected data as possible. In practice it should contain all the relevant 
information we have which is independent of individual measuremenfs. This could be a combination of 
both the expected intrinsic distribution of stars in the Galaxy, how they enter the sample (as influenced 
by the choice of observational wavelength and interstellar extinction), and how they are selected by the 
survey (e.g. the survey detection efficiency and magnitude limit). This could be quite complex, and will 
generally vary with the direction of observation through the Galaxy. The exponentially decreasing vol¬ 
ume density prior (section[7ll may be suitable when looking well out of fhe disk, where for a sufficiently 
deep survey the decrease in stellar density is caused mostly by the Galaxy itself rather than the survey. 
There are of course many variafions on the theme of this prior. For example, we could produce a sharper 
decrease with increasing r by using P*{r) = exp(—(r/L)“) for a > l^j 

Although stellar physics implies a maximum distance for any star, we have seen that discontinuities in 
the prior have the unpleasant property of leading to rejection of data or the cumulation of all poor data 
at the same maximum distance. For smooth priors, arbitrarily large distances receive asymptotically 
small probabilities, so are preferred. If the survey includes extragalactic objects, one will want to allow 
for very large (but not infinite) distances. This could be achieved using a two component prior (one for 
Galactic, one for exfragalactic objects) with two modes. When the posterior has two modes, one should 
generally report both, but may choose only to report one if it is significantly higher than the other. 

Setting an informative prior which is wrong could be disastrous. When faced with a choice, I recom- 

Any "correction" which is independent of the data can be built into the model (e.g. the prior) so is no longer a correction. 

In that case, the equation for the turning points is eouationllglbut with /L replaced by (a/L“)r“+T 
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Figure 16: Cumulative distribution of fractional parallaxes errors on a log scale (left) and a linear scale 
(right). The black line is the expected fractional parallax error (i.e. /true) for Gaia, calculated using the 
GUMS catalogue with the post-launch, sky-averaged, 5-year mission astrometric accuracy model. The 
red line is the measured fractional parallax error (i.e. /) for the Hipparcos catalogue. 


mend using the simplest prior possible consistent with the constraints, as its influence on the results will 
be easier to interpret. As with all cases of data analysis and interpretation - whether Bayesian or not - 
results will depend on personal choices. A good approach is to test the sensitivity of the results to the 
use of differenf but equally acceptable priors. The poor quality data (high /) are more affected by the 
choice of prior, and fhese will anyway have broader posteriors. Thus it is imperative that confidence 
intervals are reported on every distance estimate. I suggest to use 5% and 95% quantiles to define a 
90% confidence interval. The standard deviation, a, should not be used, because r is strictly positive. 
Quoting r ± (T implies a Gaussian and negative distances. 

Given a posterior, one is still left with the choice of estimator. For the priors I have tested, the mode 
(taking care to deal with multimodality as described) is more accurate (lower bias and variance) and 
faster to calculate (no numerical integration) than the mean or median. 


10 Hipparcos and Gaia 


We have seen that when the fractional parallax errors are larger than about 20%, use of a prior is im¬ 
perative in order to make reasonable distance estimates at all and to assign confidence limits. Simply 
rejecting data with larger errors is in the best case a waste of hard-earned survey dafa, and in the worst 
case can severely bias analyses, because we will tend to reject fainter and/or more distant stars. While 
the accuracy of the ongoing Gaia survey will surpass all previous surveys in parallax accuracy on a 
large number of sfars (accuracy of order 0.02 mas at 15*^ magnitude), it remains that many of the stars 
in the Gaia catalogue will have large fractional parallax errors. Using a published simulation of whaf 
Gaia is expected to observe (Robin et al. 2012) together with the first post-launch estimates of Gaia's 
end-of-mission, sky-averaged parallax accuracy (de Bruijne ef al. 2015), I have calculated the fraction of 
stars with expected fractional parallax errors (/true) less than some amount: see the black line in Figure 
[l6l Only about 20% of fhe Gaia catalogue will have /true < 0.2. This leaves at least 800 million stars 
requiring use of sensible prior information if we are to obtain a meaningful and not totally deviant dis¬ 
tance estimate from the astrometry. The situation is better with the Hipparcos catalogue (red line Figure 
M, yet even here only about 40% of fhe stars have / < 0.2. 

For this reason, we should make use of astrophysical parameters which will be estimated from the 
Gaia spectrophotometry and spectroscopy (Bailer-Jones et al. 2013) to improve our distance estimates. 
Specifically, the interstellar extinction and absolute magnitude, combined with our knowledge of sfellar 
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astrophysics embodied in the Hertzsprung-Russell diagram, give an independent, probabilistic deter¬ 
mination of the distance. This posterior PDF can be combined with the astrometric-based posterior PDF 
to yield a distance estimate which is more accurate than either estimate alone. 


11 Summary and conclusions 


I have shown that estimating the distance given a parallax is not trivial. The naive approach of reporting 
l/zu ± (Tro/n7^ fails for non-positive parallaxes, is extremely noisy for fractional parallax errors larger 
than about 20%, and gives an incorrect (symmetric) error estimate. Probability-based inference for the 
general case is unavoidable. Adopting an "uninformative" improper uniform prior over all positive r 
does not solve any of these problems, and is in fact both informative and implausible (viz. a volume 
density of stars varying as 1/r^). The problems can be avoided by using a properly normalized prior 
which necessarily decreases after some distance. The use of a prior wifh a sharp cuf-off is not recom¬ 
mended, because it introduces significant biases for low accuracy measurements at all distances (not 
just those near the cut-off). Instead, a prior which converges asymptotically to zero as distance goes to 
infinity should be used. Perhaps the simplest case is P*{r) = exp {—r/L), which corresponds to an 
exponentially decreasing volume density with a length scale L. It is relatively neutral in that it is broad, 
decreases slowly, and has only one parameter. The mode of the corresponding posterior is an unbiased 
estimator (for the case that the population observed conforms to the prior) and it provides meaning¬ 
ful estimates for arbitrarily large parallax errors as well as non-positive parallaxes. Bias corrections are 
not necessary. The variance of this estimator behaves as well as one could expect given the irreducible 
measurement errors. The median and mean perform less well fhan the mode. 

Distance estimates must be accompanied by uncertainty estimates. For fractional parallax errors larger 
than about 20%, the posterior over distance is significantly asymmetric, so we should always report 
confidence intervals using quantiles (e.g. 5% and 95%) rather than standard deviations. 

The performance of any distance estimator depends, of course, on how well the prior is matched to the 
true distribution of distances. One must consider how much one is willing to tune the prior to previous 
knowledge of fhe Galaxy. In any case, priors with discontinuities should be avoided, and the sensitivity 
of results to the adopted priors should always be tested and reported. Limiting inference to objects with 
fractional parallax errors much better than 20% significantly diminishes the dependence of resulfs on fhe 
choice of prior. Buf this would limit one to using only 20% of the Gaia catalogue, which may introduce 
truncation biases into subsequent astrophysical analyses. 

Its appealing properties notwithstanding, I am not advocating exclusive use of the exponentially de¬ 
creasing volume density prior. My main message is rather that one should think carefully abouf what is 
an appropriate prior, given the available information, and what impact this has on the results. 
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